Evaluation in the ARPA Machine Translation Program: 1993 Methodology
نویسندگان
چکیده
In the second year of evaluations of the ARPA HLT Machine Translation (MT) Initiative, methodologies developed and tested in 1992 were applied to the 1993 MT test runs. The current methodology optimizes the inherently subjective judgments on translation accuracy and quality by channeling the judgments of non-translators into many data points which reflect both the comparison of the performance of the research MT systems with production MT systems and against the performance of novice translators. This paper discusses the three evaluation methods used in the 1993 evaluation, the results of the evaluations, and preliminary characterizations of the Winter 1994 evaluation, now underway. The efforts under discussion focus on measuring the.progress of core MT technology and increasing the sensitivity and portability of MT evaluation methodology. 1. I N T R O D U C T I O N . Evaluation of Machine Translation (MT) has proven to be a particularly difficult challenge over the course of its history. As has been noted elsewhere (White et al., 1993), assessment of how well an expression in one language is conveyed in another is loaded with subjective judgments, even when the expressions are translated by professional translators. Among these judgments are the extent to which the information was conveyed accurately, and the extent to which the information conveyed was fluently expressed in the target language. The inherent subjectivity has been noted, and attempts have been made in MT evaluation to use such judgments to best qualitative advantage (e.g., van Slype 1979). The means of capturing judgments into quantifiably useful comparisons among systems have led to legitimate constraints on the range of the evaluation, such as to the scope of the intended end-use (Church and Hovy 1991), or to the effectiveness of the linguistic model (Jordan et al. 1992, Nomura 1992, Gamback et al. 1991). The ARPA MT Initiative encompasses radically different approaches, potential end-uses, and languages. Consequently, the evaluation methodologies developed for it must capture quantifiable judgments from subjectivity, while being relatively unconstrained otherwise. This paper presents the 1993 methodologies, and the results of the 1993 MT evaluation. We further discuss the preliminary status of an evaluation now underway that greatly increases the participation of the entire MT community, while refining the sensitivity and portability of the evaluation techniques. 2. M T E V A L U A T I O N IN T H E A R P A M T I N I T I A T I V E The mission of the ARPA MT initiative is "to make revolutionary advances in machine translation technology" (Doddington, personal communication). The focus of the investigation is the "core MT technology." This focus tends, ultimately, away from the tools of MT and toward the (fully automatic) central engines. It is well understood that practical MT will always use tools by which humans interact with the algorithms in the translation process. However, the ARPA aim is to concentrate on fully-automatic (FA) output in order to assess the viability of radical new approaches. The May-August 1993 evaluation was the second in the continuing series, along with dry runs and pre-tests of particular evaluation methods. In 1992, evaluation methods were built on human testing models. One method employed the same criteria used in the U.S. government to determine the competence of human translators. The other method was an "SAT"-type evaluation for determining the comprehensibility of English texts translated manually into the test source languages and then back into English. The methods have been replaced by methods which maintain familiarity in terms of human testing, but which are both more sensitive and more portable to other settings and systems. The Fluency, Adequacy, and Comprehension evaluations developed for the 1993 evaluation are described below; system outputs from 1992 were subjected to 1993 methods, which determined their enhanced sensitivity (White et al., op. cit.). The 1993 evaluation included output from the three research systems, five production systems, and translations from novice translators. Professional translators produced reference translations, by which outputs were compared in tile Adequacy evaluation, and which were used as controls in the Comprehension evaluation. The research systems were: • CANDIDE (IBM Research: French English(FE)), produced both FA and human-assisted (HA) outputs. 135 [Human Language Technology, Plainsboro, 1994]
منابع مشابه
Approaches to Black Box MT Evaluation
In the course of four evaluations in the Advanced Research Projects Agency Machine Translation series, evaluation methods have evolved for measuring the core components of a diverse set of systems. This paper describes the methodologies in terms of the most recent evaluation of research and production MT systems, and discusses indications of ways to improve the focus and portability of the eval...
متن کاملOverview of the ARPA Human Language Technology Workshop
The I-ILT workshop provides a forum where researchers can exchange information about very recent technical progress in an informal, highly interactive setting. The scope includes not just speech recognition, speech understanding, text understanding, and machine translation, but also all spoken and written language work (broadly interpreted to include ARPA's TIPSTER, MT, MUC, and TREC programs) ...
متن کاملMachine Translation
Machine Translation was one of the declared highlights and focal points of the Human Language Technology Workshop. Machine Translation, or MT for short, has seen a renaissance in recent years, brought about by the availability of faster and more powerful computing, and several decades of advances in speech and language processing. ARPA now sponsors a machine translation initiative and companies...
متن کاملThe Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملSession 4: Machine Translation
About 2~ years ago, ARPA initiated a new program in Machine Translation (MT). Three projects were funded: CANI)II)E, built by IBM in New York; I,INGSTA'I~, built by Dragon Systems in Boston; and PAN(;I,()SS, a col laborat ion of the Computing Research Laboratory at New Mexico State University, the Center for Machine Translation at Carnegie Mellon University, and the Information Sciences Institu...
متن کامل